Regression generally refers to predicting a real number. However, it can also be used for classification (predicting a class or category).
The term linear refers to the fact that the method model data with linear combination of explanatory variables.
linear combination is an expression where one or more variables are scaled by a constant factor and added together.
We assume that the true relationship between X and Y takes the form \(Y=f(X)+ \epsilon\), for some unknown function f, where \(\epsilon\) is a mean- zero random error term. If f is to be approximated by a linear function, then we can write this relationship as \[Y = \beta_0 + \beta_1 X + \epsilon \] Here \(\beta_0\) is the intercept term and it is the expected value of Y when X=0, and \(\beta_1\) is the slope- the average increase in Y associated with a one-unit increase in X. The above model defines the population regression line, which is the best linear approximation to the true relationship between X and Y. The least squares regression coefficient estimates characterize the least squares line, defined by \[ \hat y = \hat \beta_0 + \hat \beta_1 x \]. – least square estimates are unbiased estimations. \[ \hat \beta_1 = \frac{\sum _{i =1}^n (x_i - \bar x)(y_i - \bar y)}{\sum _{i=1}^n (x_i - \bar x)^2} \]
\[ \beta_0 = \bar y - \hat\beta_1 \bar x \]
– If we wonder how close \(\hat \beta_0\) and \(\hat \beta_1\) are to the true values \(\beta_0\) and \(\beta_1\). To compute the standard errors associated with \(\hat \beta_0\) and \(\hat \beta_1\), we use the following formulas: \[SE(\hat \beta_0 )^2 = \sigma^2 [\frac{1}{n} + \frac{\bar x^2}{\sum_{i=1}^n (x_i -\bar x)^2}]\] \[SE(\hat \beta_1 )^2 = \frac{\sigma^2}{\sum_{i=1}^n (x_i -\bar x)^2}.\]
Here we assume that each observation have common variance \(\sigma^2\). Notice in the formula that SE(\(\hat \beta_1\)) is smaller when the \(x_i\) are more spread out.
For linear regression the 95% confidence interval for \(\beta_1\) approximately takes the form \[\hat \beta_1 \pm 2SE(\hat \beta_1). \] Similarly, the confidence interval for \(\hat \beta_0\) approximately takes the form \[\hat \beta_0 \pm 2SE(\hat \beta_0). \]
We know that associated with each observation is an error term \(\epsilon.\) Due to these error terms, even if we know the true regression line i.e. if \(\beta_1\) and \(\beta_0\) were known, we would not be able to perfectly predict Y from X. The RSE is an estimate of the standard deviation of \(\epsilon.\) Actually, it is the average amount that the response will deviate from the true regression line. It is computed using the formula \[ RSE = \sqrt {\frac{1}{n-2}RSS} = \sqrt {\frac{1}{n-2} \sum_{i=1}^n (y_i - \hat y_i)^2}.\]
The RSE is considered as a measure of the lack of fit of the model to the data. But since it is measured in the units of Y, it is not always clear what constitutes a good RSE. The \(R^2\) statistic provides an alternative measure of fit. It takes the form of proportion, the proportion of variance explained and is independent of the scale of Y. To calculate \(R^2\), we use the formula
\[ R^2 = \frac{TSS - RSS}{TSS}\]
where \(TSS = \sum(y_i - \bar y)^2\) is the total sum of squares. TSS measures the total variance in the response Y, and can be thought of as the amount of the variability inherited in the response before the regression is performed. RSS measures the amount of variability that is left unexplained after performing the regression. Hence, the TSS - RSS measure the amount of variability in the response that is explained by performing the regression, and \(R^2\) measures the proportion of variability in Y that can be explained by X. — In typical application in biology, marketing and other domains, we would expect only a very small proportion of the variance in the response to be explained by the predictor, and an \(R^2\) value well below 0.1 might be more realistic.
It is possible that all of the predictors are associated with the response, but it is more often the case that the response is only associated with the subset of predictors. This is called the variable selection. There are a total of \(2^p\) models that contain subsets of \(p\) variables. Therefore, trying out every possible subset of the predictors is in-feasible. There are three classical approaches for this task:
Forward selection. We begin with the null model and then to the present model the variable that results in the lowest RSS.
Backward selection. We start from all variables and then remove the variable with the largest p-value. We continue until all remaining variables have a p-value below some threshold.
Mixed selection. This is a combination of forward and backward selection. We start from no variables in the model and as with forward selection, we add the variables that provide the best fit. Of course the p-value for variables can become larger as new variable are added to the model. Hence, at any point the p-value for one variable in the model raise above a certain threshold, then we remove that variable from the model. We continue until all variables in the model have a sufficiently low p-value and all variables outside of the model would have a large p-value if added to the model.
The linear model overestimate dependent variable in the case where most of advertising money was spent exclusively on either TV or radio. and underestimate where the budget was split between the two media. This non-linear pattern suggest a synergy or interaction effect between variables, whereby combining the variables together results in a bigger boost to sale (dependent variable) than using any single variable. In this case we extend the linear model to accommodate such synergistic effect.
The inaccuracy in the coefficient estimates is related to the reducible error.
Of course assuming a linear model for \(f(X)\) is almost always an approximation of reality, so there is additional source of error which is called model bias.
Even if we know the true values for \(\beta_0, \beta_1, ..., \beta_p\), the response value can not be predicted perfectly, because of the random error \(\epsilon\) in the model. This is called irreducible error
Two of the most assumptions in linear regression states that the relationship between the predictors and response are additive and linear. The additivity assumption means that the association between a predictor and the response does not depend on the values of other variables. The linearity assumption states that the change in the response associated with a one unit change in the independent variable \(X_j\) is constant, regardless of the value of \(X_j\). Consider the linear model, \[ Y = \beta_0+ \beta_1 X_1+ \beta_2 X_2 \] As an example, let’s say you changed the value of \(X_2\) by adding 1, such that \(\tilde X_2 = X_2 + 1\), then you would have:
\[ \tilde y = \beta_0+ \beta_1 X_1+ \beta_2 \tilde X_2 = \beta_0+ \beta_1 X_1+ \beta_2 (X_2 + 1)\] \[ = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_2 = y + \beta_2 \] So you can see pretty clearly, the effect of one-unit change of \(X_2\) on \(y\) does not depend on the values of \(X1\) (additivity) and also on the values of \(X_2\)(linearity).
The simple and multiple regression coefficients can be quite different. Because in the simple regression case, the slope term represents the average increase in Y associated with a one unit increase in X ignoring other variables. By contrast, in the multiple regression the coefficient represent the average increase in Y associated with increasing one unit in X while holding other variables fixed.
— The shark attacks and ice cream sale example shows that how a variable in simple regression can be significant, while in multiple regression does not.
standardize coefficients
multi-colinearity
\(R^2\) indicates total variability captured by independent variables.
F-statistic. \[H_0:\beta_1 = \beta_2 = \beta_3 = 0\] \[H_1: \text{at least one } \beta_{i} \neq 0\] is called overall test. If F-statistic reject the Null hypothesis then go through p-values.
\[ F = \frac{\frac{(TSS - RSS)}{p}}{\frac{RSS}{n - p - 1}}\]
when there is no relationship between the response and predictors, one would expect the F-statistic to take on a value close to 1.
Question: Given the p-values for each predictor, why do we need to look at the overall F-statistic. If we have 100 predictors and \(H_0:\beta_1 = \beta_2 = ... =\beta_p = 0\) is true, so no variable is truly associated with the response. In this case about 5% of the p-values associated with each variable will be below 0.05 by chance. In other words, we expect to see approximately five small p-value even in the absence of any true association between the predictors and the response. However, the F-statistic does not suffer from this problem because it adjusts for the number of predictors.
The approach of using an F-statistic to test for any association between the predictors and the response works when p is relatively small, and certainly small compared to n. If p>n then there are more coefficients to estimate than observations from which to estimate them. In this case we can not even fit the multiple regression model using least squares, so the F-statistic can not be used. When p is large, some of the approaches such as forward selection can be used.
Gradient descend is used to find the best parameters for linear regression.
There are twp types of variances: stochastic and deterministic. Stochastic variance is noise. The variance of dependent/independent variable is deterministic, but the variance around the model is noise.
Different types of errors in linear regression.
sum of square error (non-deterministic error): \[Error = \frac{\sum_{i =1}^{n} (\hat{y} - y_i )^2}{n}\] The line \(\hat{y}\) which gives sum of squared errors is considered as the best line.
Total error is the difference between the actual data points and the mean value of y-values.
regression error is the difference between the predicted value and the mwan value of y-values.
\[ R^2 = 1 - \frac{RSS}{TSS}\] - Adjusted coefficient of determination. When we include some useless variables in the model, the \(R^2\) will increase, but the Adjusted coefficient of determination will decrease. Adjusted Rsquare compare two models with the same dependent variable.
Correlation of Error Terms: An important assumption of the linear regression model is that the error terms for n observations, \(\epsilon_1,\epsilon_2,...,\epsilon_n\) are uncorrelated. The standard errors that are computed for the estimated regression coefficients or the fitted values are based on the assumption of uncorrelated error terms. If there is a correlation among the error terms, then the estimated standard errors will tend to underestimate the true standard error. As a result, confidence intervals for predictions will be narrower than they should be. In addition, p-values associated with model will be lower than what they should be. In cases where, observations are obtained at adjacent time points will have positively correlated errors. In this case, we need to plot the residuals as a function of time and the correlation between error terms for adjacent time points show the correlation of Error terms.
Non-constant Variance of Error terms: Another important assumption of the linear model is that the error terms have a constant variance, \(Var(\epsilon_i) = \sigma ^2.\) The standard errors, confidence intervals, and hypothesis tests with the linear model rely upon this assumption.
Unfortunately, it is often the case that the variance of the error terms are non-constant. For example the variance of the error terms may increase with the value of the response. Non-constant variance in the errors is called heteroscedasticity, form the presence of a funnel shape in the residual plot. In this case one solution is to transform the response Y using a concave function such as \(logY\) or \(\sqrt Y.\) Such a transformation results in a greater amount of shrinkage of the larger response leading to a reduction in heteroscedasticity.
Some time we have a good idea of the variance of each response. For example, the \(i\)th response could be an average of \(n_i\) raw observations. If each of observations is uncorrelated with variance \(\sigma^2\), then their average has variance \(\sigma_i^2 = \frac{\sigma^2}{n}.\) In this case the simple solution is to fit our model by weighted least squares, which weights proportional to the inverse variance, i.e. \(w_i = n_i\).
Residual plot can be used to identify outliers. Observations whose studentized residual are greater than 3 in absolute value are possible outliers, and to be removed.
High leverage Points: Outlier are observations for which the
response \(y_i\) is unusual, in
contrast, observations with high leverage have an unusual value for
\(X_i\). Removing the high leverage
observation has a much more impact on the least square line than
removing the outlier. The leverage
statistic for a simple linear regression is as follows. \[h_i = \frac{1}{n} + \frac{(x_i- \bar
x)^2}{\sum_{i=1}^n (x_i - \bar x)^2}\] The larger the value of
the leverage statistic, the higher leverage. The right-handed of above
figure provides studentized residuals versus \(h_i\). Observation 41 has a very high
leverage and high studentized residuals. This is a dangerous
combination.
Collinearity: Collinearity refers to the situation in which two or more predictor variables are closely related to each other. The presence of collinearity can pose problems in the regression context, since it can be difficult to spread out the individual effect of collinear variables on the response variable. Collinearity reduces the accuracy of the estimates of the regression coefficients, it causes the standard errors for \(\hat \beta_j\) to grow. As a result in the presence of the collinearity, the probability of correctly detecting of non-zero coefficients- the power of test- is reduced by collinearity. One way to assess multiple collinearity is to compute the variance inflation factor (VIF)
\[VIF(\hat \beta_j) = \frac{1}{1 - R^2_{X_j|X_{j-1}}}\] where \(R^2_{X_j|X_{j-1}}\) is the \(R^2\) from a regression of \(X_j\) onto all of the other predictors. If \(R^2_{X_j|X_{j-1}}\) is close to one, then collinearity is present and so the VIF will be large. As a rule of thumb, a VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity.
When faced with the problem of collinearity, there are two simple solutions. The first is to drop one of the problematic variables from the regression and the second is to combine the collinear variables together into a single predictor. For example, we might take the average of standardized versions of collinear variables in order to create a new variable that measures both variables.
low variance filter means to ignore data columns that does not have enough variance.
Male or Female can not be replaced with 0 or 1. Since in this case, Male and Female will have order. In this case we should define two variables male and female and set the value of male column to 1 if the gender of the sample is male.
If two independent variable are highly correlated, we need to keep only one variable. In this case we can use previous knowledge and drop out the variable that has higher error in measurement or use principal component analysis to merge this two variables.
Normalization on independent variables does not affect the result of linear regression. Since it does not affect the correlation between variables.
Best fit line go through the \((\bar{x} , \bar{y})\)
The total sum of squares of the observed \(y_i\) about their average \(\bar{y}\) is the total amount of variability in the dependent variable. \[\sum_{i=1}^{n} (y_i - \bar{y})^2 = \sum_{i=1}^{n} (y_i - \hat{y_i})^2 + \sum_{i=1}^{n} (\hat {y_i} - \bar{y})^2\] This amount is partitioned into the sum of squared values of the observed values about the fitted values plus the sum of squared residuals. The sum of squred residuals is reffered to as the error sum of squares or the unexplained sum of squares.
The sum of squares about the fitted values \(\hat{y_i}\) is the amount of variability attributed ti knowing the explanatory variable \(x_i\). This is called the explained or model sum of squares. Ideally we want the model sum of squares to be large relative to the error sum of squares.
For every parameter (coefficient) in the regression model, R
provides an estimated value and a standard error of that estimate.
The
t-value is the ratio
\[\text{t-value} = \frac{\text{Estimate}}{\text{Std.Error}}\] The corresponding statistical significance of this t-statistic appears under Pr[>|t|]. This tests the null hypothesis that the true, underlying population parameter is actually zero. A small p-value for this test indicates that the strong relationship between the variables could not have happened by chance alone.
If we obtain a negative coefficient for a variable that is expected to have a positive correlation with the dependent variable, it can be attributed to the presence of multicollinearity.
In general, the optimal value for K in KNN depend on the bias-variance trade off
The following figure compares the results of KNN when K = 1 (left
figure) and K = 9 (right figure). When K increases the fitted curve
using KNN represents a smoother fit.
When the true model is linear, the least square regression line
provides a very good estimate compared with KNN. Since the true model is
linear, the best result for KNN occur with a very large value of K.
when the true model is non-linear, KNN outperforms linear
regression.The following top figure is an example of slightly non-linear
data and the bottom is a strongly non-linear. The following figure consider similar
situation where there is strongly non-linear relation. When p = 1 or p =
2, KNN outperforms the linear regression. for p = 3 the results are
mixed and for p >=4 linear regression is superior to KNN. In this
case where there are 50 observations, spreading 50 observations over 20
dimension has only a small effect on the linear regression in the MSE,
but has more than ten-fold increase in the MSE for KNN. Spreading 50
observations over p = 20 dimensions results in a phenomenon in which a
given observation has no nearby neighbor. This is called curse of
dimensionality. That is the K observations that are nearest to a
given test observation may be very far from that point in p-dimensional
space when p is large, leading to a very poor prediction of f(x) and
hence a poor KNN fit.
As a general rule, parametric methods outperform non-parametric approaches when there is a small number of observations per predictor.
Even when the dimension is small, we might prefer linear regression to KNN from an interpret ability standpoint. If the test MSE of KNN is only slightly lower than that of linear regression we prefer to use a simple model that can be described in terms of a few coefficients.
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
Before fitting linear models,let’s examine the independence of our potential predictors and the dependent variable. Multiple linear regressions assume that predictors are all independent with each other.Is this assumption valid?
## cyl wt am carb
## cyl 1.0000000 0.7824958 -0.52260705 0.52698829
## wt 0.7824958 1.0000000 -0.69249526 0.42760594
## am -0.5226070 -0.6924953 1.00000000 0.05753435
## carb 0.5269883 0.4276059 0.05753435 1.00000000
If two of our predictors are highly correlated,they both provide similar information. Such multicollinearity may cause undue bias in the model. and one common practice is to remove one of the highly correlated predictors prior to fitting the model.
##
## Call:
## lm(formula = mpg ~ cyl + wt + am + carb, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5451 -1.2184 -0.3739 1.4699 5.3528
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.8503 2.8694 12.843 5.17e-13 ***
## cyl -1.1968 0.4368 -2.740 0.0108 *
## wt -2.4785 0.9364 -2.647 0.0134 *
## am 1.7801 1.5091 1.180 0.2485
## carb -0.7480 0.3956 -1.891 0.0694 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.5 on 27 degrees of freedom
## Multiple R-squared: 0.8502, Adjusted R-squared: 0.828
## F-statistic: 38.3 on 4 and 27 DF, p-value: 9.255e-11
## fit lwr upr
## Mazda RX4 21.96404 20.373847 23.55423
## Mazda RX4 Wag 21.33203 19.755852 22.90820
## Datsun 710 27.34514 25.363462 29.32682
## Hornet 4 Drive 20.95312 19.401671 22.50457
## Hornet Sportabout 17.25384 15.325001 19.18267
## Valiant 20.34590 18.773832 21.91796
## Duster 360 15.43570 13.605856 17.26554
## Merc 240D 22.66077 20.289017 25.03253
## Merc 230 22.75991 20.396509 25.12331
## Merc 280 18.15156 16.100642 20.20248
## Merc 280C 18.15156 16.100642 20.20248
## Merc 450SE 14.94443 13.558802 16.33007
## Merc 450SL 15.78711 14.287828 17.28640
## Merc 450SLC 15.66319 14.198227 17.12815
## Cadillac Fleetwood 11.27187 8.793990 13.74976
## Lincoln Continental 10.84062 8.080221 13.60102
## Chrysler Imperial 11.03642 8.405793 13.66705
## Fiat 128 27.64256 25.745837 29.53928
## Honda Civic 28.34449 26.572512 30.11648
## Toyota Corolla 28.54720 26.744770 30.34963
## Toyota Corona 25.20563 22.912462 27.49880
## Dodge Challenger 17.05556 15.188387 18.92273
## AMC Javelin 17.26623 15.333199 19.19926
## Camaro Z28 14.76651 13.219993 16.31303
## Pontiac Firebird 16.25006 14.511881 17.98823
## Fiat X1-9 28.29935 26.497549 30.10116
## Porsche 914-2 27.04330 25.461838 28.62476
## Lotus Europa 28.59730 26.726437 30.46816
## Ford Pantera L 18.20722 15.826731 20.58771
## Ferrari Dino 20.09633 17.630570 22.56209
## Maserati Bora 14.22396 10.956817 17.49110
## Volvo 142E 25.45708 23.341815 27.57234
## cyl wt am carb
## 3.018795 4.164539 2.813441 2.025210
The following figure shows that how different directions in data reduction would affect the result of classification.